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Abstract 

We consider distributed convex optimization problems originated from sample average ap¬ 
proximation of stochastic optimization, or empirical risk minimization in machine learning. We 
assume that each machine in the distributed computing system has access to a local empirical 
loss function, constructed with i.i.d. data sampled from a common distribution. We propose a 
communication-efficient distributed algorithm to minimize the overall empirical loss, which is 
the average of the local empirical losses. The algorithm is based on an inexact damped Newton 
method, where the inexact Newton steps are computed by a distributed preconditioned con¬ 
jugate gradient method. We analyze its iteration complexity and communication efficiency for 
minimizing self-concordant empirical loss functions, and discuss the results for distributed ridge 
regression, logistic regression and binary classification with a smoothed hinge loss. In a standard 
setting for supervised learning, the required number of communication rounds of the algorithm 
does not increase with the sample size, and only grows slowly with the number of machines. 


1 Introduction 

Many optimization problems in data science (including statistics, machine learning, data mining, 
etc.) are formulated with a large amount of data as input. They are typically solved by iterative 
algorithms which need to access the whole dataset or at least part of it during each iteration. With 
the amount of data we collect and process growing at a fast pace, it happens more often that 
the dataset involved in an optimization problem cannot fit into the memory or storage of a single 
computer (machine). To solve such “big data” optimization problems, we need to use distributed 
algorithms that rely on inter-machine communication. 

In this paper, we focus on distributed optimization problems generated through sample average 
approximation (SAA) of stochastic optimization problems. Consider the problem 

minimize Kz[4>{w, z)], (1) 

uiSK'^ 

where z is a random vector whose probability distribution is supported on a set Z C M^, and the 
cost function cj) : x —>■ R is convex in w for every z £ Z. In general, evaluating the expected 
objective function with respect to 2 ; is intractable, even if the distribution is given. The idea of 
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SAA is to approximate the solution to m by solving a deterministic problem defined over a large 
number of i.i.d. (independent and identically distributed) samples generated from the distribution 
of z (see, e.g., [Ml Chapter 5]). Suppose our distributed computing system consists of m machines, 
and each has access to n samples Zi^i, ..., Zi^n, for i = 1,... ,m. Then each machine can evaluate a 
local empirical loss function 

1 ” 

fi{w) =*' - ^ Zij), i = 

Our goal is to minimize the overall empirical loss defined with all mn samples: 

- m - m n 

f{w) fi{w) = — 

m van 

2=1 2=1 j = l 

In machine learning applications, the probability distribution of is usually unknown, and 
the SAA approach is referred as empirical risk minimization (ERM). As a concrete example, we 
consider ERM of linear predictors for supervised learning. In this case, each sample has the form 
Zi,j = {xij,yij) G where Xij G is a feature vector and yij can be a target response in M 

(for regression) or a discrete label (for classification). Examples of the loss function include 

• linear regression: x G y G M, and 4>{w, {x,y)) = {y — nFx)'^. 

• logistic regression: x G y G {+1, —1}, and (t>{w, (x, y)) = log(l + exp(—y(ri;'^x))). 

• hinge loss: X G y G {+1,—1}, and ^(u), (x, y)) = max |0, 1 —y(u)'^x)|. 

Eor stability and generalization purposes, we often add a regularization term (A/2)||t(;||2 to make 
the empirical loss function strongly convex. More specifically, we modify the definition of fi{w) as 


n 

fi{w) - Y + 


i=i 


w 


|2 

l2) 


i = 1,. 


, m. 


(3) 


Eor example, when cj) is the hinge loss, this formulation yields the support-vector machine m- 
Since the functions fi{w) can be accessed only locally, we consider distributed algorithms that 
alternate between a local computation procedure at each machine, and a communication round 
involving simple map-reduce type of operations HUES]. Compared with local computation at each 
machine, the cost of inter-machine communication is much higher in terms of both speed/delay 
and energy consumption (e.g., mmi thus it is often considered as the bottleneck for distributed 
computing. Our goal is to develop communication-efficient distributed algorithms, which try to 
use a minimal number of communication rounds to reach certain precision in minimizing f{w). 


1.1 Communication efficiency of distributed algorithms 

We assume that each communication round requires only simple map-reduce type of operations, 
such as broadcasting a vector in to the m machines and computing the sum or average of m 
vectors in M'^. Typically, if a distributed iterative algorithm takes T iterations to converge, then it 
communicates at least T rounds (usually one or two communication rounds per iteration). There¬ 
fore, we can measure the communication efficiency of a distributed algorithm by its iteration com¬ 
plexity r(e), which is the number of iterations required by the algorithm to find a solution wt such 
that /{wt) - f{Wi,) < e. 

For a concrete discussion, we make the following assumption: 
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Assumption A. The function f : ^ W is twice continuously differentiable, and there exist 

constants L > A > 0 such that 

XI < f {w) < LI, 

where f''{w) denotes the Hessian of f at w, and I is the dx d identity matrix. 

Functions that satisfy Assumption are often called L-smooth and A-strongly convex. The 
value K = L/X > 1 is called the condition number of /, which is a key quantity in characterizing 
the complexity of iterative algorithms. We focus on ill-conditioned cases where k 3> 1. 

A straightforward approach for minimizing f{w) is distributed implementation of the classical 
gradient descent method. More specifically, at each iteration k, each machine computes the local 
gradient fl{wk) E and sends it to a master node to compute f'{wk) = (1/m) YllLi The 

master node takes a gradient step to compute w^+i, and broadcasts it to each machine for the next 
iteration. The iteration complexity of this method is the same as the classical gradient method: 
0(«;log(l/e)), which is linear in the condition number k {e.g., [34] i. If we use accelerated gradient 
methods [Ml EH EH] , then the iteration complexity can be improved to 0{y/lI\og{l/e)). 

Another popular technique for distributed optimization is to use the alternating direction 
method of multipliers (ADMM); see, e.g., [H Section 8]. Under the assumption that each local 
function /j is L-smooth and A-strongly convex, the ADMM approach can achieve linear conver¬ 
gence, and the best known complexity is 0{y/K\og{l/e)) [IT]. This turns out to be the same order as 
for accelerated gradient methods. In this case, ADMM can actually be considered as an accelerated 
primal-dual first-order method; see the discussions in [ini Section 4]. 

The polynomial dependence of the iteration complexity on the condition number can be unsat- 
ifactory. For machine learning applications, both the precision e and the regularization parameter A 
should decrease while the overall sample size mn increases, typically on the order of Q{1/y/rrm) 
{e.g., mm)- This translates into the condition number n being Q{y/mn). In this case, the iteration 
complexity, and thus the number of communication rounds, scales as for both accelerated 

gradient methods and ADMM (with careful tuning of the penalty parameter). This suggests that 
the number of communication rounds grows with the total sample size. 

Despite the rich literature on distributed optimization (e.g., [HlEHlEKIKinKISlEglEaile]) , most 
algorithms involve high communication cost. In particular, their iteration complexity have similar 
or worse dependency on the condition number as the methods discussed above. It can be argued that 
the iteration complexity 0{y/K\og{l/e)) cannot be improved in general for distributed first-order 
methods — after all, it is optimal for centralized first-order methods under the same assumption 
that f{w) is L-smooth and A-strongly convex [331134] . Thus in order to obtain better communication 
efficiency, we need to look into further problem structure and/or alternative optimization methods. 
And we need both in this paper. 

First, we note that the above discussion on iteration complexity does not exploit the fact that 
each function /j is generated by, or can be considered as, SAA of a stochastic optimization problem. 
Since the data Zjj are i.i.d. samples from a common distribution, the local empirical loss functions 
fi{w) = (1/n) ^{'^1 ^i,j) be similar to each other if the local sample size n is large. Under 

this assumption, Zhang et al. [52] studied a one-shot averaging scheme that approximates the 
minimizer of function / by simply averaging the minimizers of /j. For a hxed condition number, 
the one-shot approach is communication efficient because it achieves optimal dependence on the 
overall sample size mn (in the sense of statistical lower bounds). But their conclusion doesn’t allow 
the regularization parameter A to decrease to zero as n goes to infinity (see discussions in [47|h 
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Exploiting the stochastic nature alone seems not enough to overcome ill-conditioning in the 
regime of first-order methods. This motivates the development of distributed second-order methods. 
Recently, Shamir et al. m proposed a distributed approximate Newton-type (DANE) method. 
Their method takes advantage of the fact that, under the stochastic assumptions of SAA, the 
Hessians similar to each other. Eor quadratic loss functions, DANE is shown 

to converge in 0((L/A)^n“^ log(l/e)) iterations with high probability, where the notation O(-) 
hides additional logarithmic factors involving m and d. If A ~ Ijy/mn as in machine learning 
applications, then the iteration complexity becomes 0(mlog(l/e)), which scales linearly with the 
number of machines m, not the total sample size mn. However, the analysis in m does not 
guarantee that DANE has the same convergence rate on non-quadratic functions. 

1.2 Outline of our approach 

In this paper, we propose a communication-efficient distributed second-order method for minimizing 
the overall empirical loss f{w) defined in ([2|). Our method is based on an inexact damped Newton 
method. Assume f{w) is strongly convex and has continuous second derivatives. In the exact 
damped Newton method (e.g., [Ml Section 4.1.5]), we first choose an initial point wq G and 
then repeat 

Wk+i=Wk-——f; -A: = 0,1,2,... , (4) 

1 6{wk) 

where Awk and 5{wk) are the Newton step and the Newton decrement, respectively, defined as 
Awk = [f"iwk)]~^fiwk) , 

^(wk) = \/f'{wk)'^[f"{wk)]-^f'{wk) = {AwkYf''{wk)Awk ■ ( 5 ) 

Since / is the average of /i,..., /m, its gradient and Hessian can be written as 

- m - m 

f{wk) =—^fliwk), f'iwk) =—^fiiwk). (6) 

2=1 2=1 

In order to compute Awk in a distributed setting, the naive approach would require all the 
machines to send their gradients and Hessians to a master node (say machine 1). However, the 
task of transmitting the Hessians (which are dxd matrices) can be prohibitive for large dimensions d. 
A better alternative is to use the conjugate gradient (CG) method to compute Awk as the solution 
to a linear system f"{wk)Awk = f'{wk)- Each iteration of the CG method requires a matrix-vector 
product of the form 

.. m 

f'{wk)v = — ^ fi{Wk)v, 

2=1 

where v is some vector in M^. More specifically, the master node can broadcast the vector v to 
each machine, each machine computes f['{wk)v G locally and sends it back to the master node, 
which then forms the average f"{wk)v and performs the CG update. Due to the iterative nature of 
the CG method, we can only compute the Newton direction and Newton decrement approximately, 
especially with limited number of communication rounds. 

The overall method has two levels of loops: the outer-loop of the damped Newton method, and 
the inner loop of the CG method for computing the inexact Newton steps. A similar approach (using 
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a distributed truncated Newton method) was proposed in [SH [26] for ERM of linear predictors, 
and it was reported to perform very well in practice. However, the total number of CG iterations 
(each takes a round of communication) may still be high. 

First, consider the outer loop complexity. It is well-known that Newton-type methods have 
asymptotic superlinear convergence. However, in classical analysis of Newton’s method (e.g., [9l 
Section 9.5.3]), the number of steps needed to reach the superlinear convergence zone still depends 
on the condition number; more specifically, it scales quadratically in k. To solve this problem, we 
resort to the machinery of self-concordant functions |36[ I34j. For self-concordant empirical losses, 
we show that the iteration complexity of the inexact damped Newton method has a much weaker 
dependence on the condition number. 

Second, consider the inner loop complexity. The convergence rate of the CG method also 
depends on the condition number k: it takes 0{^/ll\og{l/e)) CG iterations to compute an e- 
precise Newton step. Thus we arrive at the dilemma that the overall complexity of the CG- 
powered inexact Newton method is no better than accelerated gradient methods or ADMM. To 
overcome this difficulty, we exploit the stochastic nature of the problem and propose to use a 
preconditioned CG (PCG) method for solving the Newton system. Roughly speaking, if the local 
Hessians fi{wk), ■ ■ ■, fmiwk) are “similar” to each other, then we can use any local Hessian f['{wk) 
as a preconditioner. Without loss of generality, let P = fi{wk) -|- fJ-I, where /r is an estimate of the 
spectral norm \\fi{wk) — f”iwk)\\ 2 - Then we use CG to solve the pre-conditioned linear system 

P~^f"{wk)Awk = P~^f'{wk), 

where the preconditioning (multiplication by P~^) can be computed locally at machine 1 (the 
master node). The convergence rate of PCG depends on the condition number of the matrix 
P~^f"{wk), which is close to 1 if the spectral norm \\f1{wk) — f''{wk )\\2 is small. 

To exactly characterize the similarity between /'{{wk) and f"{wk), we rely on stochastic anal¬ 
ysis in the framework of SAA or ERM. We show that with high probability, \\fi{wk) — f"{'Wk )\\2 
decreases as Oi^sfdfn) in general, and 0(y^l/n) for quadratic loss. Therefore, when n is large, 
the preconditioning is very effective and the PCG method converges to sufficient precision within a 
small number of iterations. The stochastic assumption is also critical for obtaining an initial point 
wq which further brings down the overall iteration complexity. 

Combining the above ideas, we propose and analyze an algorithm for Distributed Self-Concordant 
Optimization (DiSCO, which also stands for Distributed SeCond-Order method, or Distributed 
Stochastic Convex Optimization). We show that several popular empirical loss functions in ma¬ 
chine learning, including ridge regression, regularized logistic regression and a (new) smoothed 
hinge loss, are actually self-concordant. For ERM with these loss functions, Tabled) lists the num¬ 
ber of communication rounds required by DiSCO and several other algorithms to find an e-optimal 
solution. As the table shows, the communication cost of DiSCO weakly depends on the number of 
machines m and on the feature dimension d, and is independent of the local sample size n (exclud¬ 
ing logarithmic factors). Comparing to DANE [47], DiSCO not only improves the communication 
efficiency on quadratic loss, but also handles non-quadratic classification tasks. 

The rest of this paper is organized as follows. In Section [2] we review the definition of self- 
concordant functions, and show that several popular empirical loss functions used in machine 
learning are either self-concordant or can be well approximated by self-concordant functions. In 
Section [3] we analyze the iteration complexity of an inexact damped Newton method for minimizing 
self-concordant functions. In Section 01 we show how to compute the inexact Newton step using 
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Algorithm 

Number of Communication Rounds O(-) 

Ridge Regression 
(quadratic loss) 

Binary Classification 
(logistic loss, smoothed hinge loss) 

Accelerated Gradient 

(mn)^/^ log(l/e) 

(mn)^/^ log(l/e) 

ADMM 

(mn)^/^ log(l/e) 

(mn)^/'^ log(l/e) 

DANE g7| 

m log(l/e) 

(mn)^/^ log(l/e) 

DiSCO (this paper) 

777,1/4 iog(l/e) 

7773/4^1/4 _|_ ^l/4^1/4 


Table 1: Communication efficiency of several distributed algorithms for ERM of linear predictors, 
when the regularization parameter A in ([JD is on the order of 1/ y/mn. All results are deterministic 
or high probability upper bounds, except that the last one, DiSCO for binary classification, is a 
bound in expectation (with respect to the randomness in generating the i.i.d. samples). For DiSCO, 
the dependence on e can be improved to loglog(l/e) with superlinear convergence. 


a distributed PCG method, describe the overall DiSCO algorithm, and discuss its communication 
complexity. In Section [5l we present our main theoretical results based on stochastic analysis, and 
apply them to linear regression and classification. In SectionlU we report experiment results to illus¬ 
trate the advantage of DiSCO in communication efficiency, compared with other algorithms listed 
in Table [H Finally, we discuss the extension of DiSCO to distributed minimization of composite 
loss functions in Section [71 and conclude the paper in Section [HI 

2 Self-concordant empirical loss 

The theory of self-concordant functions were developed by Nesterov and Nemirovski for the analysis 
of interior-point methods [36]. Roughly speaking, a function is called self-concordant if its third 
derivative can be controlled, in a specific way, by its second derivative. Suppose the function 
/ : —>■ M has continuous third derivatives. We use f"{w) G to denote its Hessian at m G M'^, 
and use f'"{w)[u] G to denote the limit 

f"{w)[u] + tu) - f"iw)). 

Definition 1. A convex function f : R*^ —>■ R is self-concordant with parameter Mf if the inequality 

holds for any w G dom(/) and u G R*^. In particular, a self-concordant function with parameter 2 
is called standard self-concordant. 

The reader may refer to the books [361 [M] for detailed treatment of self-concordance. In 
particular, the following lemma [341 Corollary 4.1.2] states that any self-concordant function can 
be rescaled to become standard self-concordant. 

Lemma 1 . If a function f is self-concordant with parameter Mf, then —^f is standard self- 
concordant (with parameter 2). 
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In the rest of this section, we show that several popular regularized empirical loss functions for 
linear regression and binary classification are either self-concordant or can be well approximated 
by self-concordant functions. 

First we consider regularized linear regression (ridge regression) with 


1 ^ 

= jV ~ + 


A, 




i=l 


To simplify notation, here we use a single subscript i running from 1 to N = mn, instead of 
the double subscripts {i, j} used in the introduction. Since / is a quadratic function, its third 
derivatives are all zero. Therefore, it is self-concordant with parameter 0, and by definition is also 
standard self-concordant. 

For binary classification, we consider the following regularized empirical loss function 


N 




2 = 1 


(7) 


where Xi £ X C yi € {— 1 , 1 }, and (p : 


is a convex surrogate function for the binary loss 


function which returns 0 if = sign(t(;^Xi) and 1 otherwise. We further assume that the elements 
of X are bounded, that is, we have sup 3 ,g_:^^ ||x ||2 < B for some finite B. Under this assumption, the 
following lemma shows that the regularized loss i{w) is self-concordant. 

Lemma 2. Assume that 7 > 0 and there exist Q > 0 and a £ [0,1) such that \ip'"{t)\ < 
for every t £ M. Then: 

(a) The function £{w) defined by equation ([7]) is self-concordant with parameter ^ 1 ^ 2 +^ • 

(b) The scaled function f{w) = £{w) is standard self-concordant. 

Proof. We need to bound the third derivative of i appropriately. Using equation © and the 
assumption on ip, we have 


1 


N 


{£'"{w)[u])u\ < \p"'{yjw'^Xi){yiU^Xj 


« Q 


2=1 

N 


- jV ^ “ {B\\u\\2] 

2 = 1 

( 22 ) 

< 5 ^+ 2 “ 


1 - 1-20 


A i=i 7 


1 - 1-20 


(m) 


< {u'^£"{w)u) “(||w||2)^+^". 

In the above derivation, inequality (i) uses the property that \yi\ = 1 and \u'^Xi\ < i?||ii|| 2 , inequality 
(ii) uses Holder’s inequality and concavity of and inequality (iii) uses the fact that the 

additional regularization term in £{w) is convex. 

Since £ is 7 -strongly convex, we have u^£"{w)u > 7 ||rt|| 2 - Thus, we can upper bound ||tt ||2 by 
Idlb < j~^^‘^{u'^£"{w)u)^^'^. Substituting this inequality into the above upper bound completes 
the proof of part (a). Given part (a), part (b) follows immediately from Lemma [TJ □ 
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Figure 1: Smoothed hinge loss (pp with p = 3,5, 10, 20. 


It is important to note that the self-concordance of i essentially relies on the regularization 
parameter 7 being positive. If 7 = 0, then the function will no longer be self-concordant, as 
pointed out by Bach [3] on logistic regression. Since we have the freedom to choose p. Lemma [2] 
handles a broad class of empirical loss functions. Next, we take the logistic loss and a smoothed 
hinge loss as two concrete examples. 


Logistic regression For logistic regression, we minimize the objective function ([7]) where p is 
the logistic loss: p(t) = log(l -|- e“*). We can calculate the second and the third derivatives of p{t): 


p"{t) 


p"'{t) 


(e* + 1)2 ’ 

(et + l)3 i + 


Since < 1 for all t E M, we conclude that \p'''{t)\ < p"{t) for all t E M. This implies that 

the condition in Lemma [2] holds with Q = 1 and a = 0. Therefore, the regularized empirical loss 
£{w) is self-concordant with parameter and the scaled loss function f{w) = {B'^/ 

is standard self-concordant. 


Smoothed hinge loss In classification tasks, it is sometimes more favorable to use the hinge 
loss p{t) = max{0,1 — t} than using the logistic loss. We consider a family of smoothed hinge loss 
functions Pp parametrized by a positive number p > 3. The function is defined by 


1 - Pul - f 

2 p-i 

3 _ _ y- I h+(p-3)/(p-l))P 

2 p-1 p(p-l) 

p+1 t I 1/1 f\2 

(2-t)P 

p(p-l) 

0 
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for t < 


P-3 

p—1 ’ 


for -1^ <1-^ 

for 1 - ^ < t < 1 , 
for 1 < t < 2 , 


for t > 2. 


Pp{t) = < 


( 8 ) 















Algorithm 1: Inexact damped Newton method 

input: initial point wq and specification of a nonnegative sequence {cfc}. 
repeat for A: = 0,1,2,... 

1. Find a vector Vk such that \\f''{vjk)vk — f'iwk )\\2 < Cfc- 

2 . Compute 4 = ^Jvlf"{wk)vk and update Wk+i = Wk - j^Vk- 
until a stopping criterion is satisfied. 


We plot the functions ipp for p = 3,5, 10, 20 on Figured) As the plot shows, ipp{t) is zero for t > 2, 
and it is a linear function with unit slope for t < ~^E^- These two linear zones are connected by 
three smooth non-linear segments on the interval [— , 2]. 

The smoothed hinge loss <pp satishes the condition of Lemma [2] with Q = p — 2 and a = . 

To see this, we note that the third derivative of Pp^t') is nonzero only when t G [—j 1 ~ and 
when t G [1,2]. On the hrst interval, we have 

= (^t + = {p-2)(^t + . 

On the second interval, we have 

V"(i) = {2 - , <p'pt) = -{p - 2) (2 - tf-^. 


For both cases we have the inequality 

which means Q = p — 2 and a = Therefore, according to Lemma [2) the regularized empirical 
loss l{vS) is self-concordant with parameter 

'-y 2 p —2 



and the scaled loss function j{w) = {M^ /A)l{w) is standard self-concordant. 


3 Inexact damped Newton method 

In this section, we propose and analyze an inexact damped Newton method for minimizing self- 
concordant functions. Without loss of generality, we assume the objective function / : —>■ M 

is standard self-concordant. In addition, we assume that Assumption |A] holds. Our method is 
described in Algorithm [TJ If we let = 0 for all A: > 0, then Vk = {f''{wk)\~^f'{wk) is the exact 
Newton step and 5k is the Newton decrement defined in ([5]), so the algorithm reduces to the exact 
damped Newton method given in dH). But here we allow the computation of the Newton step 
(hence also the Newton decrement) to be inexact and contain approximation errors. 
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The explicit account of approximation errors is essential for distributed optimization. In par¬ 
ticular, if /(rc) = ( 1 /m) Ym=i the components fi locate on separate machines, then we 

can only perform Newton updates approximately with limited communication budget. Even in a 
centralized setting on a single machine, analysis of approximation errors can be important if the 
Newton system is solved by iterative algorithms such as the conjugate gradient method. 

Before presenting the convergence analysis, we need to introduce two auxiliary functions 

oj{t) = t — log(l + t), t > 0 , 

= —t — log(l —t), 0 <t < 1. 

These two functions are very useful for characterizing the properties of self-concordant functions; 
see [331 Section 4.1.4] for a detailed account. Here, we simply note that a;(0) = u;*(0) = 0, both are 
strictly increasing for t > 0, and w* (t) —)• oo as t —)• 1 . 

We also need to define two auxiliary vectors 

Uk = [f''{Wk)]~^^‘^f{Wk), 

Vk = [f"{Wk)]^^‘^Vk. 

The norm of the first vector, ||ufc ||2 = f'{wkY[f''{wk)]~^ f'{wk), is the exact Newton decrement. 
The norm of the second one is ||ufc ||2 = 5k, which is computed during each iteration of Algorithm [H 
Note that we do not compute Uk or Vk in Algorithm [T1 They are introduced solely for the purpose 
of convergence analysis. The following Theorem is proved in Appendix O 

Theorem 1. Suppose f : M is a standard self-concordant function and Assumption\^ holds. 

If we choose the sequence {efc}fc>o in Algorithmic as 

ek = (3{X/L)^^‘^\\f'{wk)\\2 with P = 1/20, (10) 


then: 

(a) For any k>0, we have f{wk+i) < f{wk) - \uj{\\ukh)- 
(h) If \\ukh < 1/6, then we have a;(||ufc+i|| 2 ) < ^io{\\uk\\ 2 )- 

As mentioned before, when = 0, the vector Vk = [f"i'Wk)]~^f'{wk) becomes the exact Newton 
step. In this case, we have Vk = Uk, and it can be shown that f{wk+i) < f{wk) — w(||ufc|| 2 ) for all 
k>0 and the exact damped Newton method has quadratic convergence when ||Mfc ||2 is small (see 
[34l Section 4.1.5]). With the approximation error Ck specified in (fT0]l . we have 

\\vk - Ukh < \\{f'{Wk))~^^^\\2\\f"{Wk)vk - f{wk)\\2 < 

= /3L-^/^f{wk)\\2</3\\hk\\2, 


which implies 


(1 - /3)\\uk\\2 < ||^'fc||2 < (1 + /3)\\uk\\2- 


( 11 ) 


Appendix shows that when fi is sufficiently small, the above inequality leads to the conclusion 
in part (a). Compared with the exact damped Newton method, the guaranteed reduction of the 
objective value per iteration is cut by half. 
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Part (b) of Theorem [T] suggests a linear rate of convergence when Hufclb is small. This is slower 
than the quadratic convergence rate of the exact damped Newton method, due to the allowed 
approximation errors in computing the Newton step. Actually, superlinear convergence can be 
established if we set the tolerances to be small enough; see Appendix [B] for detailed analysis. 
However, when is computed through a distributed iterative algorithm (like the distributed PCG 
algorithm in Section l4.2p . a smaller would require more local computational effort and more 
rounds of inter-machine communication. The choice in equation (|10l) is a reasonable trade-off in 
practice. 

Using Theorem [H we can derive the iteration complexity of Algorithm [1] for obtaining an 
arbitrary accuracy. We present this result as a corollary. 

Corollary 1. Suppose f : W is a standard self-concordant function and Assumption H] 

holds. If we choose the sequence {cfc} in Algorithm [7] as in (iinD, then for any e > 0, we have 
fi'Wk) — fi'w*) < e whenever k > K where 


K = 


fjwo) - f{w* 
ia;(l/ 6 ) 


+ 



V e / 


( 12 ) 


Here [t] denotes the smallest nonnegative integer that is larger or equal to t. 


Proof. Since uj{t) is strictly increasing for t >0, part (a) of Theorem [T] implies that if ||nfc ||2 > 1/6, 
one step of Algorithm [D decreases the value of f{w) by at least a constant So within at 

most Ki = iterations, we are guaranteed that || 5 fc ||2 < 1 / 6 . 

According to [Ml Theorem 4.1.13], if ||ufe ||2 < 1, then we have 


OjiWukh) < f{Wk) - /(u>*) < W^djUfclb). 


(13) 


Moreover, it is easy to check that a;*(t) < 2uj{t) for 0 < t < 1/6. Therefore, using part (b) of 
Theorem [H we conclude that when k > Ki, 

f{wk)-fM < 2a;(||Sfc||2) < 2(l/2)"-^ia;(||SxJ|2) < 2(l/2)^-^io;(l/6). 


Bounding the right-hand side of the above inequality by e, we have f{wk) — f{w*) < e whenever 


k>Ki + 


log2 


2aj(l/6) 


We note that when 
for e > 2 w(l/ 6 ), it suffices to have k > Ki. 


= K, which is the desired result. 

2 < 1/6 (as long as k > ATi), we have f{wk) — f{wif) < 2a;(l/6). Thus 

□ 


3.1 Stopping criteria 

We discuss two stopping criteria for Algorithm [TJ The first one is based on the strong convexity 
of /, which leads to the inequality (e.g., [Ml Theorem 2.1.10]) 

f{wk) - f{w^) < ^||/'(w^fc)|l2- 

Therefore, we can use the stopping criterion \\f'{wk )\\2 < \/2Ae, which implies f{wk) — f{Wi,) < e. 
However, this choice can be too conservative in practice (see discussions in [9l Section 9.1.2]). 
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Another choice for the stopping criterion is based on self-concordance. Using the fact that 
for 0 < t < 0.68 (see [HI Section 9.6.3]), we have 

fim) - fiw^) < w*(||Sfc|| 2 ) < \\uk\\l (14) 

provided ||tife ||2 < 0.68. Since we do not compute ||ttfc ||2 (the exact Newton decrement) directly 
in Algorithm (TJ we can use 6^ as an approximation. Using the inequality (|lll) . and noticing that 
\\vk \\2 = <5fc, we conclude that 

4 < (1 - /3)Ve 

implies f{wk) — fiw*) < e when e < 0.68^. Since 5^ is computed at each iteration of Algorithm (U 
this can serve as a good stopping criterion. 


3.2 Scaling for non-standard self-concordant functions 


In many applications, we need to deal with empirical loss functions that are not standard self- 
concordant; see the examples in Section [ 2 j Suppose a regularized loss function i{w) is self- 
concordant with parameter > 2. By Lemma [H the scaled function f = rji with rj = M|/4 
is standard self-concordant. We can apply Algorithm [ 1 ] to minimize the scaled function /, and 
rewrite it in terms of the function i and the scaling constant rj. 

Using the sequence {tk} defined in (fTOt) . the condition for computing in Step 1 is 

\\f"{Wk)Vk - f'{Wk)\\2 < P{X/{Wk)\\2- 

Let \(, and be the strong convexity and smoothness parameters of the function 1. With the 
scaling, we have A = r]\i and L = t/L^, thus their ratio (the condition number) does not change. 
Therefore the above condition is equivalent to 

\r{Wk)Vk-nWk)\\2 < /3(A,/L,)V2||/(^,)||2. (15) 


In other words, the precision requirement in Step I is scaling invariant. 
Step 2 of Algorithm [T] can be rewritten as 

Vk 

Wk+l =Wk - ; 

I + V^- \/vl^"{wk)vk 


(16) 


Here, the factor rj explicitly appears in the formula. By choosing a larger scaling factor rj, the 
algorithm chooses a smaller stepsize. This adjustment is intuitive because the convergence of 
Newton-type method relies on local smoothness conditions. By multiplying a large constant to i, 
the function’s Hessian becomes less smooth, so that the stepsize should shrink. 

In terms of complexity analysis, if we target to obtain i{wk) — I'(tc*) < e, then the iteration 
bound in (fT^ becomes 

r]{e{wo) -£{w^)) 

Ia;(I/ 6 ) 





\ rje J 


(17) 


For ERM problems in supervised learning, the self-concordant parameter M^, and hence the scaling 
factor r] = M'jjA, can grow with the number of samples. For example, the regularization parame¬ 
ter 7 in ([7|) often scales as 1/yfN where N = mn is the total number of samples. Lemma [2] suggests 
that 7 grows on the order of y/mn. A larger 7 will render the second term in (jI7|] less relevant, but 
the first term grows with the sample size mn. In order to counter the effect of the growing scaling 
factor, we need to choose the initial point wq judiciously to guarantee a small initial gap. This will 
be explained further in the next sections. 
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4 The Disco algorithm 


In this section, we adapt the inexact damped Newton method (Algorithm[T|) to a distributed system, 
in order to minimize 

^ m 

f{w) =—^fi{w), (18) 

2 = 1 

where each function /j can only be evaluated locally at machine i (see background in Section [1]) . 
This involves two questions: (1) how to set the initial point wq and (2) how to compute the inexact 
Newton step Vk in a distributed manner. After answering these two questions, we will present the 
overall DiSCO algorithm and analyze its communication complexity. 


4.1 Initialization 


In accordance with the averaging structure in (|18p . we choose the initial point based on averaging. 
More specifically, we let 


ni 

Wo = — 

m 

2 = 1 


(19) 


where each Wi is the solution to a local optimization problem at machine v. 


Wi = arg min 

loSK'* 




z = 1 ,..., m. 


( 20 ) 


Here p > 0 is a regularization parameter, which we will discuss in detail in the context of stochastic 
analysis in Section [5l Roughly speaking, if each /j is constructed with n i.i.d. samples as in ([3]), 
then we can choose p ~ ^l\pn to make E,[f{wo) — /(rc*)] decreasing as 0{l/y/n). In this section, 
we simply regard it as an input parameter. 

Here we comment on the computational cost of solving (I20p locally at each machine. Suppose 
each fi{w) has the form in ([3|), then the local optimization problems in ()20p become 


Wj = arg min 
weRd 



i = 1 ,..., m. 


( 21 ) 


The finite average structure of the above objective function can be effectively exploited by the 
stochastic average gradient (SAG) method [401141j or its new variant SAGA [TTj. Each step of these 
methods processes only one component function (p{w,Zij), picked uniformly at random. Suppose 
fi{w) is L-smooth, then SAG returns an e-optimal solution with 0{{n+ ^^)log(l/e)) steps of 
stochastic updates. For ERM of linear predictors, we can also use the stochastic dual coordinate 
ascent (SDGA) method [H], which has the same complexity. We also mention some recent progress 
in accelerated stochastic coordinate gradient methods [431 [271 [53], which can be more efficient both 
in theory and practice. 


4.2 Distributed computing of the inexact Newton step 

In each iteration of Algorithm [H we need to compute an inexact Newton step Vk such that 
\\f''{wk)vk — f'iwk )\\2 < Cfc- This boils down to solving the Newton system f"{wk)vk = f'{wk) 
approximately. When the objective / has the averaging form (I18h . its Hessian and gradient are 
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Algorithm 2: Distributed PCG algorithm (given and jj., compute Vk and 6k) 
master machine (z = 1 ) machines i = 1,... ,m 

input: Wk G and /U > 0. 

let H = f"{wk) and P = fi{wk) + lal. 

communication: 

broadcasts rcfc to other machines; -)• compute 

aggregate /-(wk) to form f'{wk). i - 

initialization: compute Ck given in (jlOp and set 
ti(o) = 0, s^^^ = 

7-(o) = f'{wk), n® = 

repeat for t = 0 , 1 , 2 , 

1. communication: 

broadcast and - > compute f"{wk)u^^^ 

aggregate to form and Hv^^\ < - compute f”{wk)v^^^ 

2 . compute at = update 

= y{i) -\- atU^^\ 

j,(t+l) _ j,{t) _ ^ 


3. compute Pt = update 

g(t+i) ^ p-i^{t+i)^ 

t(d+i) = sh+b + /3trtW. 
until ||rd+i) II 2 < Ck 

return Vk = v Vk = and 6k = 


given in ([ 6 |). In the setting of distributed optimization, we propose to use a preconditioned conjugate 
gradient (PCG) method to solve the Newton system. 

To simplify notation, we use H to represent f"{wk) and use Hi to represent f"{wk)- Without 
loss of generality, we define a preconditioning matrix using the local Hessian at the first machine 
(the master node): 

^ def .tt- t 

P = Hi + nl, 

where /U > 0 is a small regularization parameter. Algorithm [2] describes our distributed PCG 
method for solving the preconditioned linear system 

p-^Hvk = P-^f'{wk). 


In particular, the master machine carries out the main steps of the classical PCG algorithm (e.g., 
|20l Section 10.3]), and all machines (including the master) compute the local gradients and Hessians 
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and perform matrix-vector multiplications. Communication between the master and other machines 
are used to form the overall gradient f'{wk) and the matrix-vector products 


^ m 

= — '^f”{wk)u 
m 


it) 


2=1 


m 


2=1 


We note that the overall Hessian H = f"{wk) is never formed and the master machine only stores 
and updates the vectors and 

As explained in Section 11.21 the motivation for preconditioning is that when Hi is sufficiently 
close to H, the condition number of P~^H might be close to 1, which is much smaller than that of H 
itself. As a result, the PCG method may converge much faster than CG without preconditioning. 
The following lemma characterizes the extreme eigenvalues of P~^H based on the closeness between 
Hi and H. 

Lemma 3. Suppose Assumptionl^ holds. If \\Hi — H \\2 < /r, then we have 

< 1 , ( 22 ) 

Here || • ||2 denote the speetral norm of a matrix, and crmax(‘) o,nd crmin(') denote the largest and 
smallest eigenvalues of a diagonalizable matrix, respeetively. 

Proof. Since both P and H are symmetric and positive definite, all eigenvalues of P~^H are 
positive real numbers (e.g., mi Section 7.6]). The eigenvalues of P ^H are identical to that of 
P~^PHP~^P. Thus, it suffices to prove inequalities (|22]1 and (|23]) for the matrix P~^PhP~^P. To 
prove inequality ([2^ . we need to show that H < P = Hi + pbl. This is equivalent to H — Hi P /rl, 
which is a direct consequence of the assumption \\Hi — H \\2 < p.1. 

Similarly, the second inequality ([231) is equivalent to H P a+ 2 /a is the same as 

^H — fj.1 P Hi — H. Since H P XI (by Assumption lA)) . we have ^H — pLl P p,I. The additional 
assumption \\Hi — H \\2 < p.1 implies /rJ P Hi — H, which complete the proof. □ 


By Assumption lAl the condition number of the Hessian matrix is k{H) = L/X, which can be 
very large if A is small. Lemma [3] establishes that the condition number of the preconditioned linear 
system is 


KiP-^H) 


CTmax(P~^P) 

arain{P-^H) 



(24) 


provided that jjPi — 77||2 < p-. When p is small (comparable with A), the condition number 
k{P~^H) is close to one and can be much smaller than k{H). Based on classical convergence 
analysis of the CG method (e.g., miE]), the following lemma shows that Algorithm [2] terminates 
in 0{i/l + p/X) iterations. See Appendix [C] for the proof. 


Lemma 4. Suppose Assumption\^ holds and assume that \\Hi — H \\2 < p. Let 


Tu = 


'^/^^l^g(^ 2/L7A||/(u;fc)||2 


Then Algorithmic terminates in iterations and the output satisfies \\Hvk — f'{wk )\\2 < Cfc- 
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Algorithm 3: DiSCO 

input: parameters p,/r > 0 and precision e > 0. 
initialize: compute wq according to (fT^ and (1201) . 
repeat for A; = 0,1, 2,... 

1. Run Algorithm O given Wk and /r, compute Vk and 6k- 

2. Update Wk+i = Wk - T^^k- 

until 4 < (1 - l3)Ve. 

output: w = Wk+i- 


When the tolerance is chosen as in (fTOj) . the iteration bound is independent of f'{wk)-, 


i.e., 



(25) 


Under Assumption |Al we always have ||-fAi — H\\2 < L. If we choose // = L, then Lemma [4] implies 
that Algorithm [2] terminates in 0 {-\/L/X) iterations, where the notation O(-) hides logarithmic 
factors. In practice, however, the matrix norm \\Hi — H\\2 is usually much smaller than L due to the 
stochastic nature of /j. Thus, we can choose ^ to be a tight upper bound on \\Hi — H\\2, and expect 
the algorithm terminating in iterations. In Section [H we show that if the local empirical 

losses fi are constructed with n i.i.d. samples from the same distribution, then \\Hi — H\\2 ~ ^ly/n 
with high probability. As a consequence, the iteration complexity of Algorithm [2] is upper bounded 
by 0(1 + 


We wrap up this section by discussing the computation and communication complexities of 
Algorithm O The bulk of computation is at the master machine, especially computing the vector 
s(d = in Step 3, which is equivalent to minimize the quadratic function (l/2)s^Ps — 

Using P = fi{wk) + fJ-I and the form of fi{w) in ([3]), this is equivalent to 


s 


(i) 


= arg min 
sSR'* 



s'^(j)"{Wk,Zij)s 

2 





(26) 


This problem has the same structure as m, and an e-optimal solution can be obtained with 
0[{n + y^) log(l/e)) stochastic-gradient type of steps (see discussions at the end of Section ITT]) . 

As for the communication complexity, we need one round of communication at the beginning 
of Algorithm [2] to compute f'{wk)- Then, each iteration takes one round of communication to 
compute and Thus, the total rounds of communication is bounded by -|- 1. 


4.3 Communication efficiency of DiSCO 

Putting everything together, we present the DiSCO algorithm in Algorithm [3l Here we study 
its communication efficiency. Recall that by one round of communication, the master machine 
broadcasts a message of 0{d) bits to all machines, and every machine processes the aggregated 
message and sends a message of 0{d) bits back to the master. The following proposition gives an 
upper bound on the number of communication rounds taken by the DiSCO algorithm. 


16 














Algorithm 4: Adaptive DiSCO 
input: parameters p > 0 and po > 0, and precision e > 0. 
initialize: compute tco according to (fT9l) and ([20]l . 
repeat for A: = 0,1, 2,... 

1. Run Algorithm [2] up to PCG iterations, with output Vk, 6k, Vk and Cfc. 

2. if ||rfc ||2 > Cfc then 

set /ifc := 2pfc and go to Step 1; 

else 

set Pfc+i := and go to Step 3. 

3. Update Wk+i = Wk - j^Vk- 

until 4 < (1 - /3)^/e. 

output: w = Wk+i- 


Theorem 2. Assume that f is a standard self-concordant function and it satisfies AssumvtionlAi 
Suppose the input parameter p in Algorithmic is an upper bound on \\fi{wk) — f''{wk )\\2 for all 
k > 0. Then for any e > 0, in order to find a solution w satisfying f{w) — f{Wi,) < e, the total 
number of communication rounds T is bounded by 


T <1 + 


fjwp) - fjw^) 
iw(l/6) 


+ 


log2 


2 w(l/6) 



(27) 


Ignoring logarithmic terms and universal constants, the rounds of communication T is bounded by 

^ - f{Wi,) + log(l/e)) Vl + 2/i/A^ . 

Proof. First we notice that the number of communication rounds in each call of Algorithm [2] is 
no more than 1 + T^, where is given in (1251) . and the extra 1 accounts for the communication 
round to form f'{wk). Corollary [T] states that in order to guarantee f{wk) — f{w*) < e, the total 
number of calls of Algorithm [2] in DiSCO is bounded by K given in (11211 . Thus the total number of 
communication rounds is bounded by 1 + Ar(l + T^), where the extra one count is for computing 
the initial point wp defined in (|19p . □ 

It can be hard to give a good a priori estimate of p that satishes the condition in Theorem [2j 
In practice, we can adjust the value of // adaptively while running the algorithm. Inspired by a line 
search procedure studied in [35], we propose an adaptive DiSCO method, described in Algorithm [H 
The following proposition bounds the rounds of communication required by this algorithm. 

Theorem 3. Assume that f is a standard self-concordant function and it satisfies Assumption C4l 
het Pmax be the largest value of fik generated by Algorithm 0 i.e., Pmax = max{po,Pi, • • • 
where K is the number of outer iterations. Then for any e > 0, in order to find a solution w 
satisfying f{w) — /(tc*) < e, the total number of communication rounds T is bounded by 


T < 1+ 2 


7(^^o) - fjWif) 
a;(l/6) 


+ 2 


logT^) 


+ log2 


Pmax 

PO 
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Proof. Let be the number of calls to Algorithm [2] during the kth iteration of Algorithm [H We 
have 

Aifc+i = = ^fc2"'=-2, 

which implies 

o I 1 k'k+l 
nfc = 2 + log2-. 

IJ-k 

The total number of calls to Algorithm [2] is 
K-l K-l 

Nr = 

k=0 k=0 

Since each call of Algorithm [2] involves no more than + 1 communication rounds, we have 

r<i + iVx(r^_ + i). 

Plugging in the expression of K in (fT^ and in (1^5]) . we obtain the desired result. □ 

From the above proof, we see that the average number of calls to Algorithm [2] at each iteration 
is 2 + log2 ) roughly twice as the non-adaptive Algorithm[3l Ignoring logarithmic terms and 
universal constants, the number of communication round T used by Algorithm H] is bounded by 

O {{f{wo) - f{w^) +log2(l/e))\/l + 2/i max/-^^ • 

In general, we can update fik in Algorithm [4] as follows: 

_ f OincfJ-k if|kfc||2>efc, 

' 1 k-k/Odec if Ikfclb < efc, 

with any Oi^c > 1 and 6dec ^ 1 (see [35]). We have used 9inc = ^dec = 2 to simplify presentation. 

4.4 A simple variant without PCG iterations 

We consider a simple variant of DiSCO where the approximate Newton step is computed without 
using the PCG method described in Algorithm |2j Instead, we simply set 

Vk = P~^f'{wk) = ifiiwk) + nl)~^fiwk), (28) 

which is equivalent to setting Vk = in the initialization phase of Algorithm [2l or forcing it to 
always exit during the first PCG iteration. (The latter choice gives the same search direction but 
with a slightly different scaling.) In this variant, each iteration of the inexact damped Newton 
method requires two communication rounds: one to form f'{wk) and another to compute the 
stepsize parameter 5k = (v'^f"iwk)vk)^^‘^■ 

A distributed algorithm that is similar to this variant of DiSCO is proposed in m- It does not 
compute 6k', instead it uses line search to determine the step size, which also requires extra round(s) 
of communication. It is shown in m that this method works well in experiments, requiring less 
number of iterations to converge than ADMM. However, according to their theoretical analysis, its 
iteration complexity still depends badly on the condition number. 
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Here we examine the theoretical conditions under which this variant of DiSCO enjoys a low 
iteration complexity. Recall the two auxiliary vectors defined in Section [3l 


Uk = H Vk 

The norm of their difference can be bounded as 

Wvk-Ukh = \\H^^^P~^f'{wk)-Uk\ 


= 


Vk- 




‘^k ^fc||2 


= \\I-P-^H 


2- 


From Lemma O we know that when ||Lfi — Lr||2 < /r, the eigenvalues of P are located within 
the interval [y:^, !]• Therefore, we have 


- Ukh < 1 - 


A 


A + 2/i 


2 — 


2/i 


A T 2// 


2- 


The above inequality implies 


1 - 


2fi 


X + 2i2 


2 < 


2 ^ 


< 1 + 


2/i 


A + 2fjL 


This inequality has the same form as dm), which is responsible to obtain the desired low complexity 
result if is sufficiently small. Indeed, if < /3 = ^ as specified in (fTO]) . the same convergence 
rate and complexity result stated in Theorem [1] and Corollary [1] apply. Since each iteration of 
the damped Newton method involves only two communication rounds (to compute f'{wk) and 5k 
respectively), we have the following corollary. 

Corollary 2. Assume that f is a standard self-concordant function and it satisfies Assumption CH 
In the DiSCO algorithm, we compute the inexact Newton step using (f28]l . Suppose ^ and 

Wfiiwk) — f''{wk )\\2 < h for all k > 0. Then for any e > 0, in order to find a solution w satisfying 
f{w) — /(rc*) < €, the total number of communication rounds T is bounded by 


T <1 + 2 


fjwo) - f{w*) 
ia;(l/6) 


+ 


log2 


2 a;(l/6) 


(29) 


In Corollary[2l the requirement on /i, which upper bounds \\fiiwk) — f"iwk )\\2 for all fc > 0, is 
quite strong. In particular, it requires /r to be a small fraction of A in order to satisfy 
we will see from the stochastic analysis in the next section, the spectral bound fj, decreases on the 
order of Ify/n. Therefore, in the standard setting where the regularization parameter A ~ Ify/mn, 
the condition in Corollary [2] cannot be satisfied, and the convergence of this simple variant may be 
slow. In contrast, DiSCO with PCG iterations is much more tolerant of a relatively large p, and 
can achieve superlinear convergence with a smaller e^. 


5 Stochastic analysis 

From Theorems [2] and [3] of the previous section, we see that the communication complexity of 
the DiSCO algorithm mainly depends on two quantities: the initial objective gap f{wo) — f{w-,f) 
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and the upper bound fj. on the spectral norms {{/"{wk) — f"iwk )\\2 for all k > 0. As we discussed 
in Section 13.21 the initial gap f{wo) — f{w-^) may grow with the number of samples due to the 
scaling used to make the objective function standard self-concordant. On the other hand, the 
upper bound /r may decrease as the number of samples increases based on the intuition that the 
local Hessians and the global Hessian become similar to each other. In this section, we show how to 
exploit the stochastic origin of the problem (SAA or ERM, as explained in Section [T]) to mitigate the 
effect of objective scaling and quantify the notion of similarity between local and global Hessians. 
These lead to improved complexity results. 

We focus on the setting of distributed optimization of regularized empirical loss. That is, our 
goal is to minimize /(rc) = (1/m) where 


fiiw) 


, A„ „2 

- 2_^4){w,Zij) + -\\w\\ 2 , 

^ i=i 


i = 1,... ,m. 


(30) 


We assume that Zij are i.i.d. samples from a common distribution. Our theoretical analysis relies 
on refined assumptions on the smoothness of the loss function cj). In particular, we assume that for 
any z in the sampling space Z, the function has bounded first derivative in a compact set, 

and its second derivatives are bounded and Lipschitz continuous. We formalize these statements 
in the following assumption. 

Assumption B. There are finite constants (Vq, G, L, M), such that for any z G Z: 

(i) z) > 0 for all w G and (/)(0, z) < Vq; 

(a) \\(fi{w,z )\\2 < G for any ||u;||2 < 

(in) \\<f"{w,z )\\2 < L — X for any w G 

(iv) \\(f"{u, z) — z )\\2 < M\\u — ui||2 for any u,w G 

For the regularized empirical loss in (I30p . condition (Hi) in the above assumption implies XI ^ 
fi{w) ^ LI for i = 1,..., m, which in turn implies Assumption [Al 

Recall that the initial point wq is obtained as the average of the solutions to m regularized local 
optimization problems; see equations ()19ll and (12011 . The following lemma shows that expected value 
of the initial gap f{wo) — fiwfij decreases with order Ij^/n as the local sample size n increases. 
The proof uses the notion and techniques of uniform stability for analyzing the generalization 
performance of ERM [7]. See Appendix [D] for the proof. 

Lemma 5. Suppose that Assumption[B\ holds and E[||r(;*|| 2 ] < for some constant D > 0. If we 
choose p = in ([20]) to compute Wi, then the initial point wq = ^ satisfies 


and 


max{||w*||2, ll-wolb} < 

(31) 

mim) /(w'*)] < ^ ■ 

V ^ 

(32) 


Here the expectation is taken with respect to the randomness in generating the i.i.d. data. 
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Next, we show that with high probability, \\f”{w) — f''{w )\\2 ~ ydjn for any i G {!,... ,m} 
and for any vector w in an ^2-ball. Thus, if the number of samples n is large, the Hessian matrix of / 
can be approximated well by that of /*. The proof uses random matrix concentration theories m- 
We defer the proof to Appendix lEl 

Lemma 6. Suppose Assumvtion holds. For any r > 0 and any i G m}, we have with 

probability at least 1 — 5, 


where 


sup Wfiiw) - f"iw)\\2 < fMr,5, 

w\\2^r 


def 



log (l + 



+ 


log{md/5) 

d 


(33) 


If (p{w,Zij) are quadratic functions in w, then we have M = 0 in Assumption [Bl In this case. 
Lemma [6] implies \\fi{w) — f''{w )\\2 ~ yl/n. For general non-quadratic loss. Lemma [6] implies 
\\f"{w) — f"{w )\\2 ~ sfdfn. We use this lemma to obtain an upper bound on the spectral norm of 
the Hessian distances \\fi{wk) — f''iwk)\\ 2 , where the vectors Wk are generated by Algorithm [TJ 


Corollary 3. Suppose AssumvtionWl holds and the sequenee {wk}k>o generated by Algorithmic 

1/2 

. Then with probability at least 1 — 5, we have for all k > 0, 


Letr= + 


\\fi{wk) - f'{wk)h < uiin < 


L, 


32L2(i 


n 


. , rMy/^\ log(md/5) 

‘“d 1 + — 


(34) 


Proof. We begin by upper bounding the ^2-norm of Wk, for k = 0,1,2 generated by Algorithm [T] 
By Theorem [H we have f{wk) < f{u)o) for all k > 0. By Assumption iBl (i), we have 4>{w,z) > 0 
for all lu G and z £ Z. As a consequence, 

< f{wk) < fiwo) < /(O) G\\wo \\2 < Ho -b G'Hiuolb- 


Substituting ||rco||2 < ■\/‘2Vo/\ (see Lemma [5]) into the above inequality yields 

1/2 


2 < 


2Ho 2G [W^\ 

—+-V—) 


= r. 


Thus, we have ||iua ;||2 < r for all k>0. Applying Lemma [6] establishes the corollary. □ 


Here we remark that the dependence on d of the upper bound in (I34p comes from Lemma [H 
where the bound needs to hold for all point in a d-dimensional ball with radius r. However, for 
the analysis of the DiSCO algorithm, we only need the matrix concentration bound to hold for a 
finite number of vectors wo,wi,... ,wk, instead of for all vectors satisfying ||iu||2 < r. Thus we 
conjecture that the bound in especially its dependence on the dimension d, is too conservative 
and very likely can be tightened. 

We are now ready to present the main results of our stochastic analysis. The following theo¬ 
rem provides an upper bound on the expected number of communication rounds required by the 
DiSCO algorithm to find an e-optimal solution. Here the expectation is taken with respect to the 
randomness in generating the i.i.d. data set {zij}. 
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Theorem 4. Let Assumption O hold. Assume that the regularized empirical loss function f is 
standard self-concordant, and its minimizer = argmin^/(ir) satisfies E[||i(;*|||] < for some 
constant D > 0. Let the input parameters to Algorithmic be p = and p, = ^ (|33l) with 


(2Va 2G 

iir + xvirj 


GD v^WZ) 

AVo + 2G^/\' 


(35) 


Then for any e > 0, the total number of communication rounds T required to reach /({<})— /(ir*) < e 
is bounded by 


E[r] < 1 + { Cl + I 2 + C2 I 1 + 2 


a;(l/6) ^/n J 

where Gi,G 2 ,Cs are 0(1} or logarithmic terms: 


32L‘^d Cs 


1/2n 
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log2 
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1 GD \ 

Z ’ 4Vb + 2C2/A ) 


Ch = log (1^ 

03 = log (l + + log(<im/«) 


c 7 




In particular, ignoring numerical constants and logarithmic terms, we have 


E[T] = O 


log(l/e) + 


GD\ 


/ LV2dl/4\\ 

V ^ AV2ni/4 ) j ■ 


Proof. Suppose Algorithm [3] terminates in K iterations, and let tk be the number of conjugate 
gradient steps in each call of Algorithm [21 for k = 0,1,..., K — 1. For any given p > 0, we define 
as in ([25]l . Let A denotes the event that tk < for all A: E {0,... ,K — 1}. Let A be the 
complement of A, i.e., the event that tk > for some k E {0 ,... ,K — 1}. In addition, let the 
probabilities of the events A and Al be 1 — <5 and 6 respectively. By the law of total expectation, 
we have 


E[T] = E[T|^]P(^) +E[r|^]P(^) = (1 - 5)E[T|^] + (5E[T|^]. 

When the event A happens, we have T < 1 + K(T^ + 1) where is given in (I25|) : otherwise we 
have T < 1 + K{Tl + 1), where 

[ffj (36) 

is an upper bound on the number of PCG iterations in Algorithm [2] when the event A happens (see 
the analysis in Appendix lF[l . Since Algorithm [2] always return a Vk such that \\f"(wk)vk — f'(wk )\\2 < 
€k, the outer iteration count K share the same bound (1121) . which depends on the random variable 
f{wo) — fiw*)- However, and are deterministic constants. So we have 

E[r] < 1 + (1 - 6)E[K{Tf, + 1)|.A] + 61K[K(Tl + 1)|.A] 

= 1 + (1 - 5)iTf, + 1)E[A:|^] + 5(Tl + 1)E[A:|^]. (37) 
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Next we bound E[it'|^] and separately. To bound E[iir|^], we use 

E[K] = (1 - 6)E[K\A] + 6E[K\A] > (1 - 6)E[K\A] 

to obtain 

E[K\A] < E[K]/{1 - 6). (38) 

In order to bound E[it'|^], we derive a deterministic bound on f{wo) — By Lemma O we 

have lircolb < -\/2Vo/A, which together with Assumption [B] (ii) yields 

||/'(^(^)ll2 <G + A||u;||2 <G+ 

Combining with the strong convexity of /, we obtain 

f{wo) - f{w^) < ^\\f'{wo)\\l < ^ (gt \/2 AFo) 

Therefore by Corollary [H 
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< 21 / + - 




log2 




(39) 


w(l/6) 

where the additional 1 counts compensate for removing one [•] operator in (I12p . 

Using inequality (i37]l . the bound on E[iL|Al] in (l38]l and the bound on E[Ar|^] in ([39]), we obtain 


E[T] < 1 + (T^ + l)E[iL] + 6(Tl + l)iCmax. 
Now we can bound E[iL] by Corollary [1] and Lemma [5j More specifically, 


m] < + 
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where Cq = 1 + [log2(2w(l/6)/e)]. With the choice of 6 in (1351) and the definition of Tl in (1361) . 
we have 


6{Tl + l)iLmax = 
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Putting everything together, we have 
E[r] < 1 + ^Go + -% 

< 1 + (^GiT 


GD 
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2^6 + 1 Gd\ 


n 4Uo + 2G2/A w(l/6) 
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y A + 1). 


Replacing by its expression in (|25|) and applying Corollary [3l we obtain the desired result. □ 
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According to Theorem 01 we need to set the two input parameters p and p in Algorithm [3] 
appropriately to obtain the desired communication efficiency. Using the adaptive DiSCO method 
given in Algorithm 01 we can avoid the explicit specification of p = pr^s defined in ([33]l and (|3^ . 
This is formalized in the following theorem. 

Theorem 5. Let Assumption O hold. Assume that the regularized empirical loss function f is 
standard self-concordant, and its minimizer tc* = argmin^„/(re) satisfies E[||r(;*|||] < for some 
constant D > 0. Let the input parameters to Algorithm\^ be p = and any po > 0. Then the 
total number of communication rounds T required to reach f{w) — f{w,,) < e is bounded by 


E[r] = o 


log(l/e) + 


GD\ 


f LV2dV4\\ 

[} ^ AV2ni/4 ) J ■ 


Proof. In Algorithm 01 the parameter pk is automatically tuned such that the number of PCG 
iterations in Algorithm [2l is no more than . By Corollary [3l with probability at least 1 — d, we 
have 

max{/xo, ...,Pk}< ‘2^Pr,6 

where pr^s is defined in ([33]) , and r and 6 are given in (I35p . Therefore we can use the same arguments 
in the proof of Theorem 0] to show that 


E[r] < 1 + + 


a;(l/6) ^/n J 


where 


Cl = (2 + 2 


( 2 + ^2 ( 1 + 4 


6ZL,^a Us 


log. ( 


+ log2 ( — 
Mo 


1 + ^ 


1 CD \ 

^ ' 4Vo + 2G2/aJ 


and 6*2 and C 3 are the same as given in Theorem 01 Ignoring constants and logarithmic terms, we 
obtain the desired result. □ 

In both Theorems 0] and El the parameter p = depends on a constant D such that 

E[||r(;*|||] < D^. In practice, it may be hard to give a tight estimate of E[||'u;*|| 2 ]. An alterna¬ 
tive is to fix a desired value of D and consider the constrained optimization problem 


minimize f{w). 

\\w\\ 2 '^D 


To handle the constraint ||rc ||2 < D, we need to replace the inexact damped Newton method in 
DiSCO with an inexact proximal Newton method, and replace the distributed PCG method for 
solving the Newton system with a preconditioned accelerated proximal gradient method. Further 
details of such an extension are given in Section [71 

Remarks The expectation bounds on the rounds of communication given in Theorems 0] and E] 
are obtained by combining two consequences of averaging over a large number of i.i.d. local sam¬ 
ples. One is the expected reduction of the initial gap f{wo) — f{Wif) (Lemma El), which helps 
to mitigate the effect of objective scaling used to make / standard self-concordant. The other is 
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a high-probability bound that characterizes the similarity between the local and global Hessians 
(Corollary [3]). If the empirical loss / is standard self-concordant without scaling, then we can regard 
fi'Wo) — fiw*) as a constant, and only need to use Corollary [3] to obtain a high-probability bound. 
This is demonstrated for the case of linear regression in Section 15.11 

For applications where the loss function needs to be rescaled to be standard self-concordant, 
the convexity parameter A as well as the “constants” {Vo,G, L, M) in Assumption |B] also need to 
be rescaled. If the scaling factor grows with n, then we need to rely on Lemma [5] to balance the 
effects of scaling. As a result, we only obtain bounds on the expected number of communication 
rounds. These are demonstrated in Section [5.21 for binary classification with logistic regression and 
a smoothed hinge loss. 


5.1 Application to linear regression 

We consider linear regression with quadratic regularization (ridge regression). More specifically, we 
minimize the overall empirical loss function 

/H = + (41) 

i=i j=i 

where the i.i.d. instances are sampled from Xxy. We assume that A C and T C M are 

bounded; there exist constants Bx and By such that ||x ||2 < Bx and \y\ < By for any (x, y) G A x T- 
It can be shown that the least-squares loss 4>{w, {x,y)) = {y — w'^x)'^ satisfies Assumption iBl with 

Vq = By, G = 2Bx ^By + BxBy^2 /, L = A -|- M = 0. 

Thus we can apply Theorems 0] and [S] to obtain an expectation bound on the number of communi¬ 
cation rounds for DiSCO. For linear regression, however, we can obtain a stronger result. 

Since / is a quadratic function, it is self-concordant with parameter 0, and by definition also 
standard self-concordant (with parameter 2). In this case, we do not need to rescale the objective 
function, and can regard the initial gap f{wo) — as a constant. As a consequence, we can 

directly apply Theorem [2] and Corollary [3] to obtain a high probability bound on the communication 
complexity, which is stronger than the expectation bounds in Theorems 0] and O In particular. 
Theorem [2] states that if 

\\fi{wk) - f'{wk)\\ 2 < for all A: = 0,1,2,... , (42) 

then the number of communication rounds T is bounded as 


T <1 + 


fjwo) - f{w*) 

uj{l/6) 
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log2 


2^(l/6) y 




Since there is no scaling, the initial gap /(tco) —/(rc*) can be considered as a constant. For example, 
we can simply pick rco = 0 and have 









By Corollary [3] and the fact that M = 0 for quadratic functions, the condition (1421) holds with 
probability at least 1 — <5 if we choose 


/ 2,2L‘^d llog(md/6) 8L /——-——— 

" = V(43) 

Further using L < A + 2B^, we obtain the following corollary. 

Corollary 4. Suppose we apply DiSCO (Algorithmic to minimize f{w) defined in (HTIl with the 
input parameter // in (@3]), and let T be the total number of communication rounds required to find 
an e-optimal solution. With probability at least 1 — 5, we have 

^ + aV^T^) log(l/e)log(md/5)). (44) 

We note that the same conclusion also holds for the adaptive DiSCO algorithm (Algorithm |4]) , 
where we do not need to specify the input parameter /i based on ()45]) . For the adaptive DiSCO 
algorithm, the bound in (1441) holds for any 5 E (0,1). 

The communication complexity guaranteed by Corollary [4] is strictly better than that of dis¬ 
tributed implementation of the accelerated gradient method and ADMM ( cf. Table [T]) . If we choose 
A = 0(l/y^mn), then Corollary [4] implies 

T = O log(l/e)^ 

with high probability. The DANE algorithm m , under the same setting, converges in 0{m log(l/e)) 
iterations with high probability (and each iteration requires two rounds of communication). Thus 
DiSCO enjoys a better communication efficiency. 

5.2 Application to binary classification 

For binary classification, we consider the following regularized empirical loss function 

^ m n 

^{w) (45) 

i=l j=l 

where Xij E T C M'^, pij E {—1,1}, and : M —>■ M is a convex surrogate function for the binary 
loss. We further assume that the elements of A are bounded, i.e., we have sup 2 ,g;v^ ||x ||2 < B for 
some finite B. 

Under the above assumptions. Lemma [2] gives conditions on y? for i to be self-concordant. As 
we have seen in Section [2l the function i usually needs to be scaled by a large factor to become 
standard self-concordant. Let the scaling factor be p, we can use DiSCO to minimize the scaled 
function f{w) = rii{w). Next we discuss the theoretical implications for logistic regression and the 
smoothed hinge loss constructed in Section [2j These results are summarized in Table [H 

Logistic Regression For logistic regression, we have ip{t) = log(l -|- e“*). In Section [2l we have 
shown that the logistic loss satisfies the condition of Lemma[2]with Q = 1 and a = 0. Consequently, 
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with the factor r] = the rescaled function f{w) = r]i{w) is standard self-concordant. If we 
express / in the standard form 

1 m n ^ 

/H = — + i^WwWl, (46) 

i=l j=l 

then we have </>(rc, {x,y)) = r]ip{yw'^x) and A = 777. It is easy to check that Assumption [Bl holds 
with 


Vo = r]\og{2), G = r]B, L = /A + ■y), M = r]B^/10, 

which all containing the scaling factor rj. Plugging these scaled constants into Theorems S] and O 
we have the following corollary. 

Corollary 5. For logistic regression, the number of communication rounds required by DiSCO to 
find an e-optimal solution is bounded by 


E[T] = O 


e) + 


B^D 


yn 


1/2 


1 + 


Bd^!^ 

71/2^1/4 


In the specific case when 7 = 0(l/y^mn), Corollary [5] implies 

E[r] = O (m^l^d^l^ + mi/^cii/^log(l/e)) 


If we ignore logarithmic terms, then the expected number of communication rounds is independent 
of the sample size n, and only grows slowly with the number of machines m. 


Smoothed Hinge Loss We consider minimizing i{w) in (1451) where the loss function (/? is the 
smoothed hinge loss defined in Q, which depends on a parameter p >3. Using Lemma[2l we have 
shown in Section [2] that i{w) is self-concordant with parameter Mp given in ([2]). As a consequence, 
by choosing 

Mf {p - 2 ) 25 ^+!^ 

the function f{w) = r]i{w) is standard self-concordant. If we express / in the form of (|46l) . then 
4>{w, {x,y)) = rnpp{yup- x) and A = 77. It is easy to verify that Assumption iBl holds with 

Vo = y, G = r]B, L = t]{B^ + X), M = y{p-2)B^. 


If we choose p = 2 -|- log(l/7), then applying Theorems 0] and [5] yields the following result. 

Corollary 6. For the smoothed hinge loss (pp defined in ([8]) with p = 2-|-log(l/7), the total number 
of communication rounds required by DiSCO to find an e-optimal solution is bounded by 


E[T] - o((log(l/6) + (1 + 


Thus, the smoothed hinge loss enjoys the same communication efficiency as the logistic loss. 
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Dataset name 

number of samples 

number of features 

sparsity 

Govtype 

581,012 

54 

22% 

RCVl 

20,242 

47,236 

0.16% 

News20 

19,996 

1,355,191 

0.04% 


Table 2: Summary of three binary classification datasets. 


6 Numerical experiments 

In this section, we conduct numerical experiments to compare the DiSCO algorithm with several 
state-of-the-art distributed optimization algorithms: the ADMM algorithm (e.ff., m), the acceler¬ 
ated full gradient method (AFG) [SH Section 2.2], the L-BFGS quasi-Newton method {e.g., [371 
Section 7.2]), and the DANE algorithm [47] . 

The algorithms ADMM, AFG and L-BFGS are well known and each has a rich literature. In 
particular, using ADMM for empirical risk minimization in a distributed setting is straightforward; 
see [3 Section 8]. For AFG and L-BFGS, we use the simple distributed implementation discussed 
in Section 1 1.1 1 at each iteration k, each machine computes the local gradients /[{wk) and sends 
it to the master machine to form f'{wk) = (1/^) /i master machine executes 

the main steps of the algorithm to compute w^+i- The iteration complexities of these algorithms 
stay the same as their classical analysis for a centralized implementation, and each iteration usually 
involves one or two rounds of communication. 

Here we briefly describe the DANE (Distributed Approximate NEwton) algorithm proposed 
by Shamir et al. mi- Each iteration of DANE takes two rounds of communication to compute 
Wk+i from Wk- The first round of communication is used to compute the gradient f'{wk) = 
(1/m) Then each machine solves the local minimization problem 

Ufe+pi = arg min \ fiiw) - {fl{wk) - f{wk),w) + ^\\w - Wk\\l \ , 

and take a second round of communication to compute Wk+i = Here /U > 0 

is a regularization parameter with a similar role as in DiSGO. For minimizing the quadratic loss 
in (I4ip . the iteration complexity of DANE is 0((L/A)^n“^ log(l/e)). As summarized in Table [H 
if the condition number L/X grows as y/mn, then DANE is more efficient than AFG and ADMM 
when n is large. However, the same complexity cannot be guaranteed for minimizing non-quadratic 
loss functions. According to the analysis in m. the convergence rate of DANE on non-quadratic 
functions might be as slow as the ordinary full gradient descent method. 

6.1 Experiment setup 

For comparison, we solve three binary classihcation tasks using logistic regression. The datasets are 
obtained from the LIBSVM datasets m and summarized in Table [2j These datasets are selected 
to cover different relations between the sample size N = mn and the feature dimensionality d: 
N ^ d (Govtype [6]), A" « d (RCVl [25]) and A <C d (News20 [22l [23]). For each dataset, our 
goal is to minimize the regularized empirical loss function: 

1 ^ 

iiw) = + exp{-yi{w^xi))) + |||rc||i 

i=l 
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m 


Covtype 


RCVl 


News20 


4 


16 


64 











Figure 2; Comparing DiSCO with other distributed optimization algorithms. We splits each dataset 
evenly to m machines, with m G {4,16,64}. Each plot above shows the reduction of the logarith¬ 
mic gap logio(-^('ul) — (the vertical axis) versus the number of communication rounds (the 

horizontal axis) taken by each algorithm. 


where Xi G and yi G {—1,1}. The data have been normalized so that ||xj|| = 1 for all i = 
1,..., The regularization parameter is set to be 7 = 10“®. 

We describe some implementation details. In Section 15.21 the theoretical analysis suggests that 
we scale the function l{w) by a factor y = /{A'y). Here we have B = 1 due to the normalization 

of the data. In practice, we find that DiSCO converges faster without rescaling. Thus, we use 
7 = 1 for all experiments. For Algorithm [31 we choose the input parameters y, = where 

yo is chosen manually. In particular, we used yo = 0 for Covtype, yo = 4 x 10“^ for RCVl, and 
yo = 2x lO”'^ for News20. For the distributed PCG method (Algorithm [2]) , we choose the stopping 
precision e*, = ||/'(u;fc)II 2 /IO. 

Among other methods in comparison, we manually tune the penalty parameter of ADMM and 
the regularization parameter y for DANE to optimize their performance. For AFC, we used an 
adaptive line search scheme [35l [28] to speed up its convergence. For L-BFGS, we adopted the 
memory size 30 (number of most recent iterates and gradients stored) as a general rule of thumb 
suggested in m, 

We want to evaluate DiSCO not only on Wk, but also in the middle of calculating Vk, to show 
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RCVl News20 


Figure 3: Comparing the sensitivity of DiSCO and DANE with respect to the regularization pa¬ 
rameter /i, when the datasets are split on m = 16 machines. We varied ^ from 10“^ to 128 x 10“^. 
The vertical axis is the logarithmic gap log]^Q(£({c) — ■^(rc*)) after 40 rounds of communications. 


its progress after each round of communication. To this end, we follow equation (jl6p to define an 
intermediate solution for each iteration t of the distributed PCG method (Algorithm [2]) : 


^]c ^k 1 /o 5 

1 + a ."' 

and evaluate the associated objective function This function value is treated as a measure 

of progress after each round of communication. 


6.2 Performance evaluation 

It is important to note that different algorithms take different number of communication rounds per 
iteration. ADMM requires one round of communication per iteration. For AFG and L-BFGS, each 
iteration consists of at least two rounds of communications: one for finding the descent direction, 
and another one or more for searching the stepsize. For DANE, there are also two rounds of 
communications per iteration, for computing the gradient and for aggregating the local solutions. 
For DiSCO, each iteration in the inner loop takes one round of communication, and there is an 
additional round of communication at the beginning of each inner loop. Since we are interested 
in the communication efficiency of the algorithms, we plot their progress in reducing the objective 
value with respect to the number of communication rounds taken. 

We plot the performance of ADMM, AFG, L-BFGS, DANE and DiSCO in Figure[2l According 
to the plots, DiSCO converges substantially faster than ADMM and AFG. It is also notably faster 
than L-BFGS and DANE. In particular, the convergence speed (and the communication efficiency) 
of DiSCO is more robust to the number of machines in the distributed system. For m = 4, the 
performance of DiSCO is somewhat comparable to that of DANE. As m grows to 16 and 64, 
the convergence of DANE becomes significantly slower, while the performance of DiSCO only 
degrades slightly. This coincides with the theoretical analysis: the iteration complexity of DANE 
is proportional to m, but the iteration complexity of DiSCO is proportional to 

Since both DANE and DiSCO take a regularization parameter ^u, we study their sensitivity to 
the choices of this parameter. Eigure[3]shows the performance of DANE and DiSCO with the value 
of ^ varying from 10“^ to 128 x 10“^. We observe that the curves of DiSCO are relatively smooth 
and stable. In contrast, the curves of DANE exhibit sharp valley at particular values of /x. This 
suggests that DiSCO is more robust to the non-optimal choice of parameters. 
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7 Extension to distributed composite minimization 


Thus far, we have studied the problem of minimizing empirical loss functions that are standard 
self-concordant. In this section, we sketch how to extend the DiSCO algorithm to solve distributed 
composite minimization problems. By composite minimization, we consider the minimization of 

F{w) = f{w) + T(u;), (47) 


where / is a standard self-concordant function taking the form of m, and T a closed convex 
function with a simple structure (see discussions in |35]). For solving the Lasso |39], for example, 
the £i-penalty ^(w) = cr||'u;||i with cj > 0 is nonsmooth but admits a simple proximal mapping. 

We modify Algorithm [T] and Algorithm [2] to minimize the composite function F(w). To modify 
Algorithm [H we update w^+i using an inexact version of the proximal-Newton method {e.g., 
[MIEq]). More specifically, the two steps in each iteration of Algorithm [1] are replaced with: 

1. Find a vector Vk that is an approximate solution of 


minimize 




v'^fiwk) + ^{Wk 



(48) 


2. Update Wk+i = Wk - , , , , • 

Note that for 4'(rc) = 0, the above proximal-Newton method reduces to Algorithm [TJ Since Vk only 
needs to be an inexact solution to problem (|48p . we need a measure to quantify the approximation 
error. For this purpose, we define the following gradient mapping 

g{vk) arg mm\^\\g\\l +{f'{wk)vk - fiwk),g)+^iwk - Vk+g)\. 

aeRd- 12 J 


If Vk is an exact minimizer of ()48]) . then we have 115(^^)112 = 0. In the distributed setting, we only 
need to find a vector Vk such that || 5 (ufc )||2 < e^. 

It remains to devise an distributed algorithm to compute an inexact minimizer Vk- Since the 
objective function in ([48]) is not quadratic, we can no longer employ the distributed PCG method 
in Algorithm [21 Instead, we propose a preconditioned accelerated proximal gradient method. In 
particular, we modify the algorithm on the master machine in Algorithm [2] as follows: 


V 


(*+i) — 


arg min |]-{v - s^^'^Y'[f”{wk) + ^lI]{v - s^*^) 

+ {f"iwk)s'^^^ - f'{wk),v - -k '^{Wk + u)|. 


,h+i) ^ ^(t+1) ^ \/^ + ^/^A-i /„.(t+i) 


-k 


Y^l -k 2/r/A -k 1 


(49) 


where is an auxiliary vector. We output Vk = once the condition || 5 (r ’^*^^^)||2 < Cfc is 

satisfied. Each update takes one round of communication to compute the vector f"{wk)s^^\ Then, 
the sub-problem (|49p is locally solved by the master machine. This problem has similar structure 
as problems ([20]) and ([26p . and can be solved in 0((n -k ^^)log(l/e)) time using the methods 
proposed in [441ICTITl] . 
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If we replace the first term on the right-hand side of equation (j^Qll by a-iid set 

fj, = L, then the above algorithm is exactly the accelerated proximal gradient algorithm [351128j . 
which converges in 0{^/lJX) iterations. By utilizing the similarity between /f(tCfc) and f''{wk), 
and assuming \\fi{wk) — f''{wk )\\2 < for all /c > 0, it can be shown that our algorithm in (j49p 
converges in 0(1 -|- y^/r/A) iterations, which is of the same order as the PCG algorithm. 

In summary, to minimize the composite function f{w) + ^{w), we replace Algorithm [T] by the 
inexact proximal Newton method, and replace Algorithm [2] by a distributed implementation of the 
above preconditioned accelerated proximal gradient method. Under the same assumptions on /, 
we can obtain similar guarantees on the communication efficiency as stated in Theorems 0] and [5j 

8 Conclusions 

We considered distributed convex optimization problems originated from SAA or ERM, which 
involve large amount of i.i.d. data stored on a distributed computing system. Since the cost of inter¬ 
machine communication is very high in practice, communication efficiency is a critical measure in 
evaluating the performance of a distributed algorithm. For algorithms based on first-order methods, 
including accelerated gradient methods and ADMM, the required number of communication rounds 
grows with the condition number of the objective function. The condition number itself often grows 
with the number of samples due to weaker regularization required. This causes the total number 
of communication rounds to grow with the overall sample size. 

In this paper, we proposed and analyzed DiSCO, a communication-efficient distributed algo¬ 
rithm for minimizing self-concordant empirical loss functions, and discussed its application to linear 
regression and classification. DiSCO is based on an inexact damped Newton method, where the 
inexact Newton steps are computed by a distributed preconditioned conjugate gradient method. In 
a standard setting for supervised learning, its required number of communication rounds does not 
increase with the sample size, but only grows slowly with the number of machines in the distributed 
system. There are three main thrusts in our approach: 

• Self-concordant analysis. We showed that several popular empirical loss functions used in 
machine learning are either self-concordant or can be well approximated by self-concordant 
functions. We gave complexity analysis of the inexact damped Newton method, and charac¬ 
terized the conditions for both linear and superlinear convergence. 

• Preconditioned conjugate gradient (PCG) method. We proposed a distributed implementation 
of the PCG method for computing the inexact Newton step. In particular, the preconditioner 
based on similarity between local and global Hessians is very effective in reducing the number 
of communication rounds, both in theory and practice. 

• Stochastic analysis of communication efficiency. Our main theoretical results combine two 
consequences of averaging over a large number of i.i.d. samples. One is the expected reduction 
of the initial objective value, which counters the effect of objective scaling required to make 
the objective function standard self-concordant. The other is a high-probability bound that 
characterizes the similarity between the local and global Hessians. 

Our numerical experiments on real datasets confirmed the superior communication efficiency of the 
DiSCO algorithm. In addition, we also proposed an extension for solving distributed optimization 
problems with composite empirical loss functions. 
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Appendices 


A Proof of Theorem [T] 

First, we recall the definitions of the two auxiliary functions 

uj{t) = t — log(l + t), t > 0, 
io*{t) = —t — log(l — t), 0 < t < 1, 

which form a pair of convex conjugate functions. 

We notice that Step 2 of Algorithm [1] is equivalent to 


Wk+l -Wk = 


Vk 


Vk 


1 + 5k 1 + 


which implies 


\\[f'{Wk)]^^‘^{Wk+l-Wk)h = 


< 1 . 


1 + ir'^feii2 

When inequality (1501) holds, Nesterov [341 Theorem 4.1.8] has shown that 

f{Wk+l) < f{Wk) + {f'{Wk),Wk+l - Wk) +UJ^{\\[f"{Wk)]^^‘^{Wk+l - Wk)\\2)- 
Using the definition of functions ui and w*, and with some algebraic operations, we obtain 


f{Wk+l) < f{wk) - 


{Uk,Vk) 


1 + Ipfclb 1 + iTfclb 
= f{wk) - wdISfclb) + (t^dl^fclb) - ui 

By the second-order mean-value theorem, we have 

wdl^fclb) -uiiWvkh) =oj' 

for some t satisfying 


+ log(i -I- Wvkh) 

{Vk - Uk,Vk) 


0) + 


1 + 


,) + -u;"{t){\\ukh-\\M2) 


min{||nfc||2, < t < max{||nfc||2, Wvkh}- 

Using the inequality (fTT]l . we can upper bound the second derivative co"{t) as 

1 1 


co"it) = 


1 1 

< - < 


< 


l + t 1-h min{||nfc|| 2 , ||ufc|| 2 } I + {1 -/3)\\uk\\2' 


Therefore, 


UI 


)) - W 


2 = 


1 + 


+ U"(*)(lfe|b-||Bt|b) 


^ \\uk - Vkhhkh ^ (l/2)||Mfc - 


< 


1 -I- (1 - 


2 1 + {1 -/3)\\Uk\\2 

2 
2 


1 -h (1 - f3)\\uk\\2 


(50) 


(51) 
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In addition, we have 


{Vk - Uk,Vk) ^ \\uk - ^ /3(1 + /3)||nfc|li 

l + ll'f^fclb 1 + Il'Ufclb 1 + (1 ~/3)||iffc||2 

Combining the two inequalities above, and using the relation t^/(l + t) < 2uj{t) for all t > 0, we 
obtain 


< 

< 

In the last inequality above, we used the fact that for any t > 0 we have a;((l — /3)f) < (1 — /3)w(t), 
which is the result of convexity of co{t) and a;(0) = 0; more specifically, 

a;((l - l3)t) = uj{j3 • 0 + (1 - /3)t) < /3a;(0) + (1 - /3)uj{t) = (1 - I3)uj{t). 

Substituting the above upper bound into inequality (f^ yields 

f{wk+i) < fiwk) “ (^1 “ (52) 

With inequality (|52l) . we are ready to prove the statements of the lemma. In particular. Part (a) 
of the Lemma holds for any 0 < /3 < 1/10. 

For part (b), we assume that ||Mfc ||2 < 1/6. According to [Ml Theorem 4.1.13], when Ijufclb < 1, 
it holds that for every A; > 0, 


(2/3(1 + /3) + (1/2)/32) 

M + (1 - P)\\Uk\\2 

/2/3 + (5/2)/32\ (1-/3)2||S,||2 

V (l-/3)2 )i + {i-p)\\ukh 


ujiWukh) < f{wk) - f{w^) < uiMWkh)- (53) 

Combining this sandwich inequality with inequality (|52|) . we have 
w(||Ufc+l||2) < fiWk+l) - f{w^) 

< f{wk) - f{w^) - wdl^fclb) + ^dl^fclb) 

< ^^*i\\ukh) - iviWukh) + ^^^^-wdjufclb)- (54) 

It is easy to verify that a;*(t) — uj{t) < 0.26a;(t) for all t < 1/6, and (4/3 + 5/3^)/(l — /3) < 0.23 if 
/3 < 1/20. Applying these two inequalities to inequality ([Ml) completes the proof. 

It should be clear that other combinations of the value of /3 and bound on Ijufclb are also possible. 
For example, for /3 = 1/10 and ||nfc ||2 < 1/10, we have a;(||ufc+i|| 2 ) < 0.65u;(||nfc||2). 
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B Super-linear convergence of Algorithm [T] 


Theorem 6. Suppose f : M is a standard self-concordant function and Assumption m holds. 

If we choose the sequence {efc}A:>o Algorithm\^ as 

. [a;(rfc) uj^/‘^{rk)\ ^- 1 / 211 ^/^ mi 

efc = ^-min<^—,-—S, where rk = L ' \\f [wkjh, (55) 

then: 

(a) For any k>0, we have fiwk+i) < fiwk) - ^uiWukh)- 

(b) If \\ukh < 1/8, then we have wdlSfc+ilb) < \/6 

Part (b) suggests superlinear convergence when ||ttfc ||2 is small. This comes at the cost of a 
smaller approximation tolerance Ck given in (|55p . compared with (|10ll . Roughly speaking, when 
||/^(^'^fc )||2 is relative large, the tolerance in (f55]l needs to be proportional to ||/^('W^fc)|| 2 ^^ since 
u}{t) = 0{t). When ||/'(rcfc )||2 is very small, the tolerance in (l55]) needs to be proportional to 
||/'(tt)fc )||2 because Lo{t) ~ as t ^ 0. In contrast, for linear convergence, the tolerance in (fTOl) is 
proportional to ||/'(tCfc) || 2 - 


Proof. We start with the inequality (I5ip . and upper bound the last two terms on its right-hand 
side. Since uj'{t) = < 1, we have 

^iWukh) -wd|ufc||2) < lllufclb- ll^ifelbl < \\uk -Vkh- 


In addition, we have 


{Vk - Uk,Vk) 


< 


\uk - Vkh < \\uk - Vkh- 


l + lTfclb l + \\Vk\\2 

Applying these two bounds to (ISip . we obtain 

/(■w^fc+i) < fiwk) - u:{\\ukh) + 2||Sfc - Vkh- 

Next we bound \\uk — Ufc ||2 using the approximation tolerance specified in ([^Sp . 

Il^fc - Ufc||2 = [f''{Wk)]~^^‘^f'{Wk) - [f"iWk)]^^‘^Vk 

= [f''{Wk)]~^^‘^{f''{Wk)Vk - fiwk)) 

< A"^/^ \\f''{wk)vk - f{wk)\\.^ 

< A-^/^6, 


= — mm 
2 


ujjrk) \ 

. 2 ’ 10 /• 


(56) 


Combining the above inequality with (f56p . and using rk = L ^'^^||/'(tCfc )||2 < \\vkh with the 
monotonicity of a;(-), we arrive at 


f{wk+i) < f{wk) - wdlufclb) + min 


^{Uk) UJ^^'^jUk) 
2 ’ 10 


(57) 
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Part (a) of the theorem follows immediately from inequality (I57p . 

For part (b), we assume that ||ttfc ||2 < 1/8. Combining (l53]l with (IFH) . we have 


w(||Ufc+i||2) < f{wk+i) - /(w*) < f{wk) - f{w^) - w 




OJ 


3 / 2(11 


Uk\\2) 


10 


< CO. 


2 -(O 


l) + 


10 


(58) 


Let h{t) :=co^{t) — co{t) and consider only t > 0. Notice that h{0) = 0 and h'{t) = 
for t < 1/8. Thus, we conclude that h{t) < for t < 1/8. We also notice that a;(0) = 0 and 
^ lor 1 ^ 1/8. Thus, we have co{t) > for t < 1/8. Combining these results, we 

obtain 


(o^{t) 


, , 128 o 


128 


(^ 2 ) 3/2 ^ 


128 

1^ 





a;3/2(t). 


Applying this inequality to the right-hand side of (f58]l completes the proof. 


□ 


In classical analysis of inexact Newton methods [laile], asymptotic superlinear convergence 
occurs with r\j \\f'{wk)\\ 2 ^^ (in fact with ||/'(r(;fc)||| for any s > 1). This agrees with 
our analysis since Lo{t) = 0{t) when t is not too small. Our result can be very conservative 
asymptotically because co{t) ~ as t —)• 0. However, using Lo{t) and the associated self-concordance 
analysis, we are able to derive a much better global complexity result. 

Corollary 7. Suppose / : ^ M is a standard self-concordant function and Assumption [A] 

holds. If we choose the sequence {cfc} in Algorithmic as in ([5^ . then for any e < l/(3e), we have 
f{wk) — fi'Wi,) < e whenever 


k > 


fjwo) - fjw*) 
l^(l/ 8 ) 


+ 


loglog(l/(3e)) 

log(3/2) 


(59) 


where [t] denotes the smallest nonnegative integer that is larger than or equal to t. 

Proof. By part (a) ofTheoremlU ifa;(||ufc|| 2 ) > 1/8, then each iteration of Algorithm [T] decreases the 




iterations, 


function value at least by the constant ^a;(l/ 8 ). So within at most Ki := 
we are guaranteed to have ||Mfc ||2 < 1 / 8 . 

Part (b) of Theorem [ 6 ] implies bwdlufc-i-ilb) < ( 6 a;(||ufc|| 2 ))^^^ when ||ufc ||2 < 1/8, and hence 

k-Ki 


log(6a;(||uA:||2)) < 


log ( 6 a;(l/ 8 )) , k>Ki. 


Note that both sides of the above inequality is negative. Therefore, after k > Ki + 
iterations (assuming e < l/(3e)), we have 

log( 6 a;(||ufc|| 2 )) < log(l/(3e)) log (6 w(l/ 8 )) < - log(l/(3e)), 

which implies a;(||ttfc|| 2 ) < e/2. Finally using (f53]l and the fact that uj^(t) < 2w(t) for t < 1/8, we 
obtain 

f{wk) - fiw*) < w^dlufclb) < 2 a;(||ufc|| 2 ) < e. 


This completes the proof. 


□ 
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C Proof of Lemma |4] 


It suffices to show that the algorithm terminates at iteration t < T^ — 1, because when the algorithm 
terminates, it outputs a vector Vk which satisfies \\Hvk — f'{wk )\\2 = < Cfc- Denote by 

V* = H~^f'{wk) the solution of the linear system Hvk = f'{wk)- By the classical analysis on the 
preconditioned conjugate gradient method (e.g., [29l[2]), Algorithm [2] has the convergence rate 

(uW - -v*)<a( M (60) 

V v'^ +1/ 

where k = 1 + 2^/\ is the condition number of P~^H given in (I24p . For the left-hand side of 
inequality (f60|) . we have 

iiT-(bii2 

(uW - - V*) = (rW)^Fr"VW > 

L 

For the right-hand side of inequality ([60|) ' we have 

{v*yhv* = ir{wk)fH-^fiwk) < 

Combining the above two inequalities with inequality (1601) . we obtain 


^(t) 


2 



VaA +1/ 


II/'A;^)I|2<2 



A 


A -|- 2^ 


t 


\\n^k)h. 


To guarantee that ||r (*)||2 < e^, it suffices to have 






where in the last inequality we used — log(l — x) > x for 0 < x < 1. Comparing with the definition 
of T^, this is the desired result. 


D Proof of Lemma [5] 


First, we prove inequality (f5T]l . Recall that tc* and Wi minimizes f{w) and fi{w) + 
both function are A-strongly convex, we have 


w-,\\l < f{w^) < /(O) < Vo, 

\m\\l < fiim) + ^\\wi\\l < /i(0) < Vo, 


£11 
2 II 


2 - Since 


which implies ||tc *||2 < 
average over 




Then inequality dSU) follows since wo is the 
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In the rest of Appendix [Dl we prove inequality (|32l) . Let z be a random variable in C 
with an unknown probability distribution. We define a regularized population risk: 

R{w) = E:,[(j){w, z)] + 

Let S be a set of n i.i.d. samples in Z from the same distribution. We define a regularized empirical 
risk 

1 , A + Pii ||2 

l2) 


rsiw) = - + 

n 




z(iS 


and its minimizer 


ws = argmin rs{w). 


The following lemma states that the population risk of ws is very close to its empirical risk. The 
proof is based on the notion of stability of regularized empirical risk minimization [7]. 

Lemma 7. Suppose AssumvtionW\ holds and S is a set of n i.i.d. samples in Z. Then we have 

2G^ 


]Es[i2(uis) - rsiws)] < 


pn 


Proof. Let S = {zi,..., Zn}- For any k E n}, we define a modified training set by 

replacing Zk with another sample Zk, which is drawn from the same distribution and is independent 
of (S'. The empirical risk on is defined as 


(fc)/ X 1 \ : ^~^P\ 

M = - nw,z) + ^-\ 


Mil- 


z&SW 


and let = argmin^ Since both rs and are p-strongly convex, we have 


rsiwg^) - rs{ws) > - wsW'i 


rs\^s) - rPiwP) > - wsWl 


Summing the above two inequalities, and noticing that 


rs{w) - rP{w) = -{4>{w,Zk) - cj){w,Zk)), 


1 


n 


we have 


w 


(k) 


- wslli < ^ - (fiwf\zk) - (j){ws,Zk) + (fiws,Zk)^ . 

By Assumption iBl fiii and the facts ||'«i 5||2 < y^2Vb/A and ii4"^i|2 < V^W^, we have 

\(f{w^g\z) - (j){ws,z)\ < G||4^ - u;s|| 2 , Vz e Z. 

Combining the above Lipschitz condition with (I6ip . we obtain 

II ^(fc) ^ ||2 ^ ^ II 

\\Ws - 2 < - \\Ws - 2- 

pn 


(61) 
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As a consequence, we have — tC 5||2 < and therefore 

0^2 

\(f){w^g\z) - (f){ws,z)\ < -, Mz^Z. (62) 

' ' pn 

In the terminology of learning theory, this means that empirical minimization over the regularized 
loss rs{w) has uniform stability 2G'^j{pn) with respect to the loss function (f; see [7]. 

For any fixed A; € {1,... , n}, since Zk is independent of S, we have 


Es[R{ws) - 


E 5 


^lk[^{ws,Zk)] 


1 X ^ 

-'^(f{ws,Zj) 

n 

j=i 


^s,zk [(p{ws, Zk) - (l){ws, Zk)] 
^S,Zk [(f>{ws, Zk) - (l){wP,Zk )], 


where the second equality used the fact that Es[4>{ws, zj) has the same value for all j = 1,... ,n, 
and the third equality used the symmetry between the pairs {S,Zk) and {S^^\zk) (also known as 
the renaming trick; see [71 Lemma 7]). Combining the above equality with (1621) yields the desired 
result. □ 


Next, we consider a distributed system with m machines, where each machine has a local dataset 
Si of size n, for z = 1,... ,m. To simplify notation, we denote the local regularized empirical loss 
function and its minimizer by rj(tc) and Wi, respectively. We would like to bound the excessive 
error when applying Wi to a different dataset Sj. Notice that 

[rjiwi) - rj{wj)] = Esi,Si [rjiwi) - ri{wi)] +Es^^Sj [riiwi) - rj{wR)] +E 5 . [rj{wR) - rj{wj)] 

' -V-' '-V-' '-V-' 

Vl V 2 VZ 

(63) 


where wr is the constant vector minimizing R{w). Since Si and Sj are independent, we have 

2G^ 

Vl = E5. [Es'Jrj(u;i)] - rj(u;j)] = E5. [R{wi) - ri(u;i)] < , 

where the inequality is due to Lemma [71 For the second term, we have 

V2 = E5, [ri{wi) - Es^.[rj(uiR)]] = E^, [ri{wi) - ri{wR)] < 0 . 

It remains to bound the third term U 3 . We first use the strong convexity of rj to obtain (e.g., |34l 
Theorem 2.1.10]) 


rj{wR) - TjiWj) < 


\r'iwR)\\l 

2p ■ 


(64) 


where r'AwR) denotes the gradient of rj at wr. If we index the elements of Sj hy zi,..., Zn, then 


r'jiwR) 


1 X ^ 

“ X] ^k) + (A + p)wr) . 

^ k=l 


(65) 
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By the optimality condition of wn = argmin^„ Riw), we have for any k £ {1,... ,n}, 

[(t)'{wR, Zk) + (A + p)wb\ = 0. 

Therefore, according to (1651) . the gradient rj{wR) is the average of n independent and zero-mean 
random vectors. Combining (1641) and (1651) with the definition of ^3 in ()63p . we have 


V3 < 


ESj [ELi WiwR, Zk) + {X + p)wr\ 


< 


2 pin? 

Y2=i [\W{WR, Zk) + (A + p)wr\\2] 

2pn‘^ 

Ylk=lHWiwR,Zk)\\l] 


2pn^ 


< 


2pn 


In the equality above, we used the fact that ())'{wR,Zk) + (A -|- p)wr are i.i.d. zero-mean random 
variables; so the variance of their sum equals the sum of their variances. The last inequality above 
is due to Assumption iBl (ii) and the fact that ||uii ?||2 < ■\/2Vq/{\ -|- p) < i/2Vb/A. Combining the 
upper bounds for vi, V 2 and U 3 , we have 


^Si,Sj [rj{wi) - rj{wj)] < 


3G^ 

pn 


( 66 ) 


Recall the definition of f{w) as 


rri n 

f{w) = - '^'^4>{w,Zi^k) + -rWMII: 

mn 2 

i=l k=l 


where Zi^k denotes the kth sample at machine i. Let r{w) = — YlJLi c(^)’ then we have 

r{w) = f{w) + ^\\w\\l. 

We compare the value r{wi), for any i £ {1,..., m}, with the minimum of r{w): 


^ m ^ m 

r{wi) — minr(t(;) = — > fj^Wi) — min — > r,- 

w m z-^ w m Z-^ 


[w 


i=i 


i=i 


^ rn ^ rri 

< — y rj{wi) - y minrRtc) 

m z-^ rn z-^ w 

i=i i=i 


1 


m 


• 


i=i 


Taking expectation with respect to all the random data sets ^i,..., Sm and using 

1 - ^, 3G^ 

E[r(u;j) — min r(u))] < — — rj{'Wj)] < 


j=i 


pn 


(67) 


we obtain 

( 68 ) 
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Finally, we bound the expected value of /{wi): 




+ 


3G2 


pn 


< E 


f{w*) + 


+ 


3G^ 

pn 


and the 


<E[/K)] + ^ + —, 

2 pn 

where the first inequality holds because of (l67|) . the second inequality is due to 
last inequality follows from the assumption that E[||t(;*|| 2 ] < . Choosing p = results in 

K[f{wi) — f{w-k)] < for every i G {1,..., m}. Since wq = ^ using the convexity of 

function / yields E[/(t(;o) — /(w^*)] < ; which is the desired result. 


E Proof of Lemma [6] 

We consider the regularized empirical loss functions fi{w) defined in ([30]) . For any two vectors 
u,w satisfying ||u — w \\2 < e, Assumption [B] (iv) implies 

ii/;»-/rHii2<Me. 

Let B{0, r) be the ball in with radius r, centered at the origin. Let r)) be the covering 

number of B{0,r) by balls of radius e, i.e., the minimum number of balls of radiusr e required to 
cover B{0,r). We also define Nf^^{B{0,r)) as the packing number of B(0,r), i.e., the maximum 
number of disjoint balls whose centers belong to B{0,r). It is easy to verify that 

N^°^B{0,r)) < N^^^{B{0,r)) < {l + 2r/ef. 

Therefore, there exist a set of points U C with cardinality at most (1 + 2r je)'^, such that for 
any vector w G B{0,r), we have 


min Wfiiw) - fiiu )\\2 < Me. (69) 

uGU 

We consider an arbitrary point u G U and the associated Hessian matrices for the functions 
fi{w) defined in (1301) . We have 


fiiu) = + >^1) ^ i = l,...,m. 

^ i=i 

The components of the above sum are i.i.d. matrices which are upper bounded by LI. By the 
matrix Hoeffding’s inequality |3n( Corollary 4.2], we have 

P[||/"(u)-E[/"(u)]||2>t] <d-e-^. 
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Note that K[f”{w)] = E[/"(t/;)] for any w E B{0,r). Using the triangular inequality and inequal¬ 
ity (IMD, we obtain 


||/('H -/"H]||2 < ||/('M -E[/rM]||2 + ||/"M -e[/"H]||2 

<2 max ||/"H-E[/"H ]||2 

<2 max (max||/"(u)-E[/"(n)]|| 2 -FMe). (70) 

ie{i,•••,"*} V ueu J 


Applying the union bound, we have with probability at least 

1 — md(l + 2r/e)“ • e 

the inequality ||/f (u) — E[/"(u)] ||2 < t holds for every i G {1,..., m} and every u G U. Combining 
this probability bound with inequality (fTOjl . we have 


P 


sup Wfiiw) - f”{w) 

*■ wGB{0,r) 


I 2 > 2t + 2Me 


d 


< md (1 -|- 2r/£) • e 


2 L 2 


(71) 


As the final step, we choose e = and then choose t to make the right-hand side of inequal¬ 

ity (ITTI) equal to 5. This yields the desired result. 


F More analysis on the number of PCG iterations 


Here we analyze the number of iterations of the distributed PCG method (Algorithm [2]) when ^ is 
misspecified, i.e., when used in P = Hi + is not an upper bound on \\Hi — H\\ 2 . For simplicity 
of discussion, we assume that Assumption lAl holds. \\Hi — H \\2 < L and /i < L. In this case, we 
can show (using similar arguments for proving Lemma [S]): 

1 2L 

^max 

L/ -r fJj 
-L fl A 

Hence the condition number of the preconditioned linear system is 


Kfi,L = 


2L 

T 


1 + 


L fi 


<2 + 


2L 

X’ 


and the number of PCG iterations is bounded by (c/. Appendix [Cj) 


vXXiog 




This gives the bound on number of PGG iterations in (j36n . 
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